Kenneth Tay
Oct 8, 2019
ggplot2
ggplot2 syntaxggplot() +
geom_violin(data = mtcars,
mapping = aes(x = factor(cyl), y = hp)) +
geom_jitter(data = mtcars,
mapping = aes(x = factor(cyl), y = hp))ggplot2 syntaxggplot2 syntaxggplot(data = mtcars,
mapping = aes(x = factor(cyl), y = hp)) +
geom_violin() +
geom_jitter() +
labs(title = "Horsepower vs. Cylinder", x = "Cylinder",
y = "Horsepower")ggplot2 syntaxggplot(data = mtcars,
mapping = aes(x = factor(cyl), y = hp)) +
geom_violin() +
geom_jitter() +
labs(title = "Horsepower vs. Cylinder", x = "Cylinder",
y = "Horsepower") +
theme_classic()dplyr (and %>% syntax)We rarely get data in exactly the form we need!
Transforming data in R is made easy by the dplyr package (“official” cheat sheet available here).
dplyr verbsselect(): pick variables by their namesmutate(): create new variables based on existing onesarrange(): reorder rowsfilter(): pick observations by their valuessummarize(): collapse many values down to a single summary## Name Gender English Math Science History Spanish
## 1 Andrew M 60 96 80 56 77
## 2 John M 66 55 56 64 77
## 3 Mary F 92 63 70 62 98
## 4 Jane F 80 76 89 55 40
## 5 Bob M 80 80 82 48 50
## 6 Dan M 58 52 79 90 61
select: pick subset of variables/columns by nameHistory teacher: “I just want their names and History scores”
select: pick subset of variables/columns by nameHistory teacher: “I just want their names and History scores”
scores dataset.select: pick subset of variables/columns by nameHistory teacher: “I just want their names and History scores”
scores dataset.## Name History
## 1 Andrew 56
## 2 John 64
## 3 Mary 62
## 4 Jane 55
## 5 Bob 48
## 6 Dan 90
mutate: create new columns based on old onesForm teacher: “What are their total scores?”
mutate: create new columns based on old onesForm teacher: “What are their total scores?”
scores dataset.mutate: create new columns based on old onesForm teacher: “What are their total scores?”
scores dataset.## Name Gender English Math Science History Spanish Total
## 1 Andrew M 60 96 80 56 77 369
## 2 John M 66 55 56 64 77 318
## 3 Mary F 92 63 70 62 98 385
## 4 Jane F 80 76 89 55 40 340
## 5 Bob M 80 80 82 48 50 340
## 6 Dan M 58 52 79 90 61 340
arrange: reorder rowsForm teacher: “Can I have the students in order of overall performance?”
arrange: reorder rowsForm teacher: “Can I have the students in order of overall performance?”
scores dataset.arrange: reorder rowsForm teacher: “Can I have the students in order of overall performance?”
scores dataset.## Name Gender English Math Science History Spanish Total
## 1 John M 66 55 56 64 77 318
## 2 Jane F 80 76 89 55 40 340
## 3 Bob M 80 80 82 48 50 340
## 4 Dan M 58 52 79 90 61 340
## 5 Andrew M 60 96 80 56 77 369
## 6 Mary F 92 63 70 62 98 385
arrange: reorder rowsForm teacher: “No no, better students on top please…”
arrange: reorder rowsForm teacher: “No no, better students on top please…”
scores dataset.arrange: reorder rowsForm teacher: “No no, better students on top please…”
scores dataset.## Name Gender English Math Science History Spanish Total
## 1 Mary F 92 63 70 62 98 385
## 2 Andrew M 60 96 80 56 77 369
## 3 Jane F 80 76 89 55 40 340
## 4 Bob M 80 80 82 48 50 340
## 5 Dan M 58 52 79 90 61 340
## 6 John M 66 55 56 64 77 318
arrange: reorder rowsForm teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”
arrange: reorder rowsForm teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”
scores dataset.arrange: reorder rowsForm teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”
scores dataset.## Name Gender English Math Science History Spanish Total
## 1 Mary F 92 63 70 62 98 385
## 2 Andrew M 60 96 80 56 77 369
## 3 Bob M 80 80 82 48 50 340
## 4 Dan M 58 52 79 90 61 340
## 5 Jane F 80 76 89 55 40 340
## 6 John M 66 55 56 64 77 318
filter: pick observations by their valuesHistory teacher: “I want to see which students scored less than 60 for history”
filter: pick observations by their valuesHistory teacher: “I want to see which students scored less than 60 for history”
scores dataset.filter: pick observations by their valuesHistory teacher: “I want to see which students scored less than 60 for history”
scores dataset.## Name Gender English Math Science History Spanish Total
## 1 Andrew M 60 96 80 56 77 369
## 2 Jane F 80 76 89 55 40 340
## 3 Bob M 80 80 82 48 50 340
Other ways to make comparisons:
>: greater than<: less than>=: greater than or equal to<=: less than or equal to!=: not equal to==: equal to (Do not use = to test for equality!!)Other ways to make comparisons:
>: greater than<: less than>=: greater than or equal to<=: less than or equal to!=: not equal to==: equal to (Do not use = to test for equality!!)Combining comparisons:
!: not&: and|: orfilter examplesDan’s parents: “I just want Dan’s scores”
filter examplesDan’s parents: “I just want Dan’s scores”
## Name Gender English Math Science History Spanish Total
## 1 Dan M 58 52 79 90 61 340
filter examplesDan’s parents: “I just want Dan’s scores”
## Name Gender English Math Science History Spanish Total
## 1 Dan M 58 52 79 90 61 340
Language teacher: “I want to know which students score < 50 for either English or Spanish”
filter examplesDan’s parents: “I just want Dan’s scores”
## Name Gender English Math Science History Spanish Total
## 1 Dan M 58 52 79 90 61 340
Language teacher: “I want to know which students score < 50 for either English or Spanish”
## Name Gender English Math Science History Spanish Total
## 1 Jane F 80 76 89 55 40 340
summarize: get summaries of dataAcademic: “I want to know the correlation between math and science scores”
summarize: get summaries of dataAcademic: “I want to know the correlation between math and science scores”
scores dataset.summarize: get summaries of dataAcademic: “I want to know the correlation between math and science scores”
scores dataset.## corr
## 1 0.5470561
summarize: get summaries of dataScience teacher: “I want to know the mean and standard deviation of the scores for science”
summarize: get summaries of dataScience teacher: “I want to know the mean and standard deviation of the scores for science”
scores dataset.summarize: get summaries of dataScience teacher: “I want to know the mean and standard deviation of the scores for science”
scores dataset.## Science_mean Science_sd
## 1 76 11.54123
dplyr commands using %>%Science teacher: “I want to know which students scored > 80 for Science, but I just want names”
dplyr commands using %>%Science teacher: “I want to know which students scored > 80 for Science, but I just want names”
scores dataset.dplyr commands using %>%Science teacher: “I want to know which students scored > 80 for Science, but I just want names”
scores dataset.## Name
## 1 Jane
## 2 Bob
group_by: use dplyr verbs on a group-by-group basisAcademic: “I want to know if the boys scored better than the girls in Spanish”
group_by: use dplyr verbs on a group-by-group basisAcademic: “I want to know if the boys scored better than the girls in Spanish”
scores dataset.group_by: use dplyr verbs on a group-by-group basisAcademic: “I want to know if the boys scored better than the girls in Spanish”
scores dataset.## # A tibble: 2 x 2
## Gender Spanish_mean
## <chr> <dbl>
## 1 F 69
## 2 M 66.2
Language teacher: “I want to know which students scored < 70 for both Spanish, but I just want names”
Language teacher: “I want to know which students scored < 70 for both Spanish, but I just want names”
scores dataset.## Name Gender English Math Science History Spanish Total
## 1 Jane F 80 76 89 55 40 340
## 2 Bob M 80 80 82 48 50 340
## 3 Dan M 58 52 79 90 61 340
Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”
Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”
scores dataset.## Name
## 1 Dan
Math teacher: “I want to know the lowest Math score for each Gender”
Math teacher: “I want to know the lowest Math score for each Gender”
scores dataset.## # A tibble: 2 x 2
## Gender min_math
## <chr> <dbl>
## 1 F 63
## 2 M 52
History teacher: “I want the names of students with their history scores, with the entries sorted by name”
History teacher: “I want the names of students with their history scores, with the entries sorted by name”
scores dataset.name column.## Name History
## 1 Andrew 56
## 2 Bob 48
## 3 Dan 90
## 4 Jane 55
## 5 John 64
## 6 Mary 62
Optional material
transmute: create new columns based on old ones, discard old onesForm teacher: “I just want the mean score for each student”
How does R understand the code filter(History < 60)?
History less than 60 or not?
History < 60 is a statement that is either TRUE or FALSETRUE, keep the rowfilter(<condition>) only returns the rows for which <condition> is TRUETRUE or FALSE: boolean expression## [1] TRUE
## [1] FALSE
## [1] FALSE
## [1] FALSE TRUE FALSE FALSE
## [1] TRUE FALSE FALSE TRUE
%>%%>% is implemented by the magrittr packagedplyr package is loaded, magrittr is loaded too%>% is “syntactic sugar”: makes code easier to understand%>% becomes the first argument in the function on the right of %>%head(mtcars, n = 6) is equivalent to mtcars %>% head(n = 6)